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We develop data structures for dynamic closest pair problems with arbitrary distance functions, 
that do not necessarily come from any geometric structure on the objects. Based on a technique 
previously used by the author for Euclidean closest pairs, we show how to insert and delete objects 
from an n-object set, maintaining the closest pair, in 0(n log 2 n) time per update and O(n) space. 
With quadratic space, we can instead use a quadtree-like structure to achieve an optimal time 
bound, O(n) per update. We apply these data structures to hierarchical clustering, greedy match- 
ing, and TSP heuristics, and discuss other potential applications in machine learning, Grobner 
bases, and local improvement algorithms for partition and placement problems. Experiments show 
our new methods to be faster in practice than previously used heuristics. 
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1. INTRODUCTION 

Hierarchical clustering has long been a mainstay of statistical analysis, and cluster- 
ing based methods have attracted attention in other fields: computational biology 
(reconstruction of evolutionary trees; tree-based multiple sequence alignment), sci- 
entific simulation (n-body problems), theoretical computer science (network design 
and nearest neighbor searching) and of course the web (hierarchical indices such as 
Yahoo). Many clustering methods have been devised and used in these applications, 
but less effort has gone into algorithmic speedups of these methods. 

In this paper we identify and demonstrate speedups for a key subroutine used in 
several clustering algorithms, that of maintaining closest pairs in a dynamic set of 
objects. We also describe several other applications or potential applications of the 
same subroutine, to TSP heuristics, greedy matching, machine learning, Grobner 
basis computation, and local optimization methods. 

Although dynamic closest pair data structures have been studied in low-dimcn- 
sional geometric spaces [Dobkin and Suri 1991; Eppstein 1995; Golin et al. 1998; 



Matias 199J ; pchwarz ct al. 199J ; [5mid 1992) ; [Bupowit 1990| , there has been little 
work on analogous structures in non-geometric spaces, or in spaces where the di- 
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mension is so high as to make taking advantage of the geometry difficult. However, 
there are several obvious approaches to this dynamic closest pair problem, ft can 
be solved by brute force (trivial recomputation) in time 0{n 2 ) per update with 
space 0(n), or by a priority queue of distances in time 0(n log n) per update and 
space 0(n 2 ). If we maintain the closest distance itself, and recompute all distances 
when we delete one of the two objects forming this distance, we can even achieve 
average-case time O(n) per update, in a model in which any deletion is equally 
likely. However, the applications we describe typically repeatedly delete the closest 
pair, making the performance of this naive algorithm much worse than its average 
case. 

Of these naive methods, brute force recomputation may be most commonly used, 
due to its low space requirements and ease of implementation. Three hierarchical 
clustering codes we exami ned, Zupan's [Zupan 1982 1, CLUSTAL W [Thompson 



ct al. 1994 1 , and PHYLIP [Fclscnstcin 1995 1 use brute force. (Indeed, they do not 



even save space by doing so, since they all store the distance matrix.) Pazzani's 
learning code [Pazzani 1997 1 also uses brute force (M. Pazzani, personal commu- 
nication), as does Mathematical s Grobner basis code (D. Lichtblau, personal com- 
munication). The clustering code listed by Anderberg [ Anderberg 1973| is perhaps 
more interesting: he uses a "nearest neighbor heuristic" in which one stores the 
index of each row minimum of the distance matrix (the nearest neighbor to each 
point), and only updates these indices when these minima change. Howev er, this 



method m ay still require 0{n 2 ) time per update in the worst case. Hartigan Harti 
gan 1975| describes the same nearest-neighbor heuristic, but resorts to brute force 



in the associated code listing. 

The purpose of this paper is to show that much better bounds are possible, using 
data structures that are simple and likely to be practi cal. We adapt and simplify 
a geometric closest pair data structure of the author Eppstein 1995[ to apply in 
our non-geometric setting, and show that it achieves nearly the best time and 
space bounds above: 0(n log 2 n) time per update and space 0(n). If linear space 
is required, this represents an order-of-magnitude speedup over known solutions. 
Further, with quadratic space, we can also improve significantly on the priority 
queue; we give an algorithm based on a quadtree-like structure in the distance 
matrix, with time 0(n) per update. This last bound is optimal, since in our model 
any algorithm needs to examine all n — 1 distances involving each newly inserted 
object. It remains open whether quadratic space is required to achieve linear time 
per update. 

Along with these theoretical results, we present experimental results on these 
data structures and some simple modifications of them. In all our experiments, 
all our new data structures are preferable to brute force, and one ("FastPair") is 
always preferable to the nearest-neighbor heuristic. The choice between it and the 
other new data structures depends on problem type and available memory. 

For recent geometric applications of similar closest pair data structures, in prob- 
lems of dynamic collision detection, offset curve construction, and skeletonization, 
see [ Eppstein and Erickson 1999 1. 
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2. MODEL OF COMPUTATION 

We assume a model in which we maintain a set of objects subject to insertions 
or deletions. We are also given a bivariate function d(s, t) measuring the distance 
between objects. This function need not satisfy the triangle inequality or other 
common properties of distances; indeed, in the Grobner basis application below 
the distances are not numbers. We assume only that function values are totally 
ordered. The task of our data structures is to maintain the pair s, t having the 
minimum value d(s,t) among all objects in the set. If two pairs have the same 
minimum value, our algorithms may return either pair. 

We assume that each object is stored in constant space, that the distance function 
can be evaluated in constant time, and that any two distances can be compared in 
constant time. These assumptions are not necessarily valid for all applications; for 



instance Cheng and Wallace [Cheng and Wallace 1993] describe an application of 



clustering to meteorology, in which the objects consist of very high dimensional vec- 
tors. In computational biology applications, objects may consist of long sequences 
of symbols, and distance evaluations may consist of complicated dynamic program- 
ming routines. In these cases our time bounds can be interpreted as numbers of 
evaluations; alternatively, with an additional 0(n 2 ) space, we can precompute and 
store the entire distance matrix. 

For the clustering applications we describe, we also assume some means of treat- 
ing clusters (sets of objects) as objects themselves, and of computing distances be- 
tween clusters. There is much freedom in determining distances between clusters. 
These distances need not be the same as the distances between objects, even for 



clusters consisting of single objects. Zupan Zupan 1982 1 describes seven different 
definitions of distance between clusters, each of which applies to arbitrary objects 
and distance functions, and each of which can be computed in constant time (with 
quadratic space to store all cluster distances) by a formula combining the distances 
between pairs of subclusters. For biological sequence data, distances between clus- 
ters may be computed by a multiple sequence alignment that respects previously 



computed alignments within each cluster [Corpet 1988; Gotoh 1994f] . Alternatively, 



distances may be defined by selecting a cluster member as a representative object 
or by combining objects to form a representative in some application-specific way 
(e.g., centroids for vector data; consensus sequences for biological sequence data). 
The distance between clusters would then be defined to be the distance between 
their representative objects. The multiple fragment heuristic for traveling salesman 
tours involves a similar idea in which each cluster is represented by two objects 
(at either end of the fragment) with the distance between clusters equal to the 
minimum of four distances between representative objects. 



3. CONGA LINE DATA STRUCTURE 



We now describe the dynamic closest pair data structure from Eppstein 1995], 
simplified somewhat by maintaining one set of objects instead of two sets, using a 
naive nearest neighbor searching technique in place of geometric range searching 
data structures, and relaxing size restrictions on subsets in a partition of the input. 

Our data structure consists of a partition of the dynamic set S into k < logn 
subsets Si, S2, ■ ■ • , Sfe, together with a digraph d for each set S,. Each digraph 
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Gi will consist of the union of a set of directed paths. Initially all objects will be 
in Si and G\ will have n — 1 edges. Gi may contain edges with neither endpoint in 
S%; if the number of edges in all graphs grows to 2n we rebuild the data structure 
by moving all points to Si and recomputing G\. As we will show below, the closest 
pair will be represented by an edge in some Gi, so we can find this pair by scanning 
the edges in all graphs. As we modify S, we create and merge these subsets Si and 
associated graphs Gi. This involves the following steps: 

Create Gi for a new partition Si.. When created, Gi will consist of a single path. 
We choose the first vertex of the path to be any object in Si. Then, we extend 
the path one edge at a time. When the last vertex in the partially created path P 
belongs to Si, we choose the next vertex to be its nearest neighbor in S \ P, and 
when the last vertex belongs to S \ Si, we choose the next vertex to be its nearest 
neighbor in Si \ P. We continue until the path can no longer be extended because 
S \ P or Si \ P is empty. 

Merge partitions. . The update operations described below can cause k to be too 
large relative to n. If so, we choose subsets Si and Sj as close to equal in size as 
possible: more precisely, if \Si\ < \Sj\, we choose these two subsets to minimize the 
size ratio ISjI/ISij. We then merge these two subsets into a single set and create 
the graph Gi for the merged subset as above. 

The construction of Gi is essentially the nearest neighbor TSP heuristic, however 
we are using it for a different purpose. The nearest neighbor searches performed 
when creating Gi can be done by a naive algorithm that tests all objects in S 
or in Si. Improvements can be made in geometric applications by applying mor e 



sophisticated range search techniques Eppstein 1995 ; Eppstein and Erickson 199S ] 



We are now ready to describe the update operations in this data structure. 

To initialize the data structure. Create a new subset Si containing all the initial 
objects, and create G\. 

To insert x. Create a new subset S^+i = {x} in the partition of S, create Gk+i, 
and merge partitions as necessary until k < log n. 

To delete x. Create a new subset Sk+i consisting of all objects y such that {y, x) 
is a directed edge in some Gi. Remove x and all its adjacent edges from all the 
graphs Gi. Create the graph Gk+i for Sk+i, and merge partitions as necessary 
until k < logn. 

Lemma 1. The data structure described above correctly maintains the closest 
pair in S . 

Proof. Let (s,t) be a closest pair, where s belongs to a subset Si created more 
recently than the subset containing t. Then when Gi was created, it contained s, 
so it contained at least one of (s,t). Then if s was the first of two vertices added to 
the path, it must have chosen as its neighbor either t or a vertex x at least as close 
to s. If it chose t, edge (s,t) exists in Gi. If it chose some x, then x can not have 
been deleted, since that would have caused s to move to a newer Sj, so {s,x) is 
at least as good as (s, t) and still exists in Gi. Similarly if t were chosen first then 
it would have formed edge (t, s) in Gi or (t, x) for some vertex x at least as close 
to t. Again, x could not have been deleted because that would cause t to move 
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to a subset Sj created more recently than Si. So in all cases Gi contains a closest 
pair. □ 

Lemma 2. The data structure above uses space 0(n). 

Proof. By construction, the graphs together have at most 2n edges (we rebuild 
the data structure if this bound is reached), so they take linear space to store. The 
partition is also easily maintained in linear space. □ 

Theorem 1. The data structure above maintains the closest pair in S in 0(n) 
space, amortized time 0{n log n) per insertion, and amortized time 0(nlog 2 n) per 
deletion. 

Proof. The correctness and space complexity have already been proven above; 
it remains to prove the time bounds. First, let us analyze the time for a sequence 
of updates that do not involve rebuilds to the data structure. 

We use a potential function argument. Define the potential of set Si to be 
n\Si\ log | Si |, and the potential of the whole data structure to be the sum of the 
potentials of each subset. This potential is at most n 2 logn (the value it would 
take for a partition consisting of a single set). The amortized time per operation is 
T + B — A, where T is the actual time used, B is the increase in the upper bound 
0(n 2 logn) on the potential, and A is the increase in the potential. Over the course 
of a sequence of operations, starting from a situation in which the potential equals 
B, the B and A terms in this formula telescope, so the total amortized time for the 
sequence is at most the total actual time; therefore this method provides a valid 
bound on amortized time. 

Each time we merge two subsets Si and Sj , the potential increases by 



A = n\ Si \ log + n|S,| log 

Since |S,| and \Sj\ must be within a factor of two of each other, the two loga- 
rithmic terms are constant and this simplifies to 0(n(|Si| + |Sj|)). Since the path 
constructed from the merged subsets has size 0(|Si| + \Sj\), and each edge in the 
path can be found in linear time, the total time for the merge is 0(n(\Si \ + |SjQ). 
Therefore any time spent performing merges can be balanced against an increase 
in the potential function. 

Each time we perform an insertion, we create a new set Sk+i with zero potential, 
and perform 0(n) work not counting mergers. However, the bound 0(n 2 logn) on 
the total potential increases by 0{n log n), and the amortized time for each insertion 
must also include this potential increase, so the total amortized time per insertion 
is O(nlogn). 

Each time we perform a deletion, we perform O(nlogn) work creating a new 
subset of at most log n objects. This work is balanced by a decrease of 0(n log n) 
in the upper bound on the total potential. Further, the new set Sk+i has some 
positive potential (up to 0(n logn log logn)). However, When we move these logn 
objects to a new set, the potential of each set Si decreases by ©(nlog|Si|) per 
object, and this potential decrease dominates the amortized time bound for each 
deletion, which is therefore 0(nlog 2 n). 
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To complete this analysis, we estimate the time spent rebuilding the data struc- 
ture. Dehne the excess of graph Gi to be |Gj| — 2|5j|. Initially, all points are in 
Si with a total excess of — n. Each time we merge two subsets, the merged graph's 
excess becomes nonpositive. The only way to create a positive excess is to move a 
point out of some Si , by deleting some other point sharing an edge with the moved 
point. Each deletion moves O(logn) points and thus increases the total excess by 
0(log n). Therefore, 0(n/ logn) deletions need to be performed before each rebuild 
and the amortized time per rebuild step is O(nlogra). □ 

4. QUADTREE DATA STRUCTURE 

We now describe a simple technique for maintaining the closest pair even more 
efficiently, if quadratic space is available. 

In a nutshell, we recursively subdivide the distance matrix into a quadtree, and 
maintain the smallest distance within each quadtree square. Each update affects 
0(n) squares along a row and column of the distance matrix, and we update the 
distances within each square by looking at each of its four children. 

In more detail, assume the objects are numbered xq, x\, . . ., x n -i- To maintain 
this numbering, when we insert a new object we give it the next highest number. 
When we delete an object X\ we renumber x n —i to have number i, so the numbers 
stay consecutive. (In practice, it may be possible to combine a deletion with a 
subsequent insertion, and avoid this renumbering step.) 

Define subsets consisting of a number of consecutive objects equal to a power of 
two: S(i,j) = {x l2] ,x a]+ i,. ■ ■ ,x {l+1 ) 2 i-i}- Equivalently, let S(i,0) = {xj and 
define S(i,j) for j > to be the disjoint union of S(2i,j — 1) and S(2i + 1, j — 1). 

Lemma 3. A set of n objects determines at most 2n — 1 distinct sets S(i,j). 

Lemma 4. There are n sets S(i,0), each consisting of a single object. Since 
each S(i,j) is the disjoint union of two smaller sets, the number of sets S(i,j) with 
cardinality 2 3 is at most half the number of sets with cardinality 2 J , so the total 
number of sets is at most n + n/2 + n/4 + • • • = 2n — 1. 

Define D(i,j,k) to be the minimum distance between a point in S(i,j) and a 
point in S(k,j). By the same reasoning as in the proof of the lemma above, the 
number of these values is at most n 2 /2 + n 2 /8 + n 2 /32 + • • • = 2n 2 /3. 

Lemma 5. Each insertion or deletion to the set of objects causes 0{n) values 
D(i,j,k) to change. 

Proof. We first consider insertion of a new object. This causes D(i,j,k) to 
change only when one of S(i,j) or S(k,j) contains the inserted object. Since for 
any j there is exactly one S(k,j) containing the inserted object, the changed values 
are in one-to-one correspondence with the sets S(i,j) not containing the inserted 
object. The result follows from Lemma ||. The argument for deletions is similar, 
except that the deletion of one object and renumbering of another causes roughly 
twice as many changes to D(i,j, k). □ 

Theorem 2. We can maintain the closest pair among a set of n objects in time 
0(n) per insertion or deletion, and 0(n 2 ) space. 
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Proof. As shown above, each update causes 0(n) changes to the values D(i, j, k) 
stored by the data structure. Each changed value can be recomputed in constant 
time, using the formula 

D(i,j, k) = mux{D(2i, 2k),D(2i+l,j-l, 2k),D(2i,j-l, 2k+l), D(2i+l,j-l, 2fc+l)}, 

if we perform the recomputation for smaller values of j before larger ones. The 
closest pair we seek is D(0, [log n] , 0). □ 

5. HIERARCHICAL CLUSTERING APPLICATION 

Hierarchical clustering is the process of forming a maximal collection of subsets of 
objects (called clusters), with the property that any two clusters are either disjoint 
or nested. Equivalently, it can be viewed as forming a rooted binary tree having 
the objects as its leaves; the clusters then correspond to the leaves of subtrees. See 



[Anderberg 1973; Duran and Odcll 1974; Hartigan 1975 



the extensive clustering litera ture. Although top-d own 



Zupan 1982 1 for surveys of 



Yianilos 1993 |, incremental 



||Zupan 1982 1, and numerical | Agarwalact al. 1998 ] hierarchical clustering methods 



are known, hierarchical clustering is often performed by a bottom up agglomerative 
approach. In agglomerative clustering, one defines a distance between pairs of 
clusters based on the distance between objects; then, starting with n single-object 
clusters, one repeatedly forms new clusters by merging the closest pair of clusters. 

Many variants of agglomerative clustering are known, largely differing in the defi- 
nition of cluster distances. This issue was discussed in more detail in our section on 
models of computation. For single-linkage distance, in which the distance between 
clusters is formed by the closest pair of objects, agglomerative clustering reduces to 
Kruskal's minimum spanning tree algorithm, and can be performed in 0(n 2 ) time 
and 0(n) space by instead applying Prim's or Boruvka's algorithm and sorting the 
MST edges. There has been some recent work on clustering in low-dimensional 
spaces [jKrznaric and Levcopoulos 199S ] or with Hamming distances on binary data 



[Aichholzer 1997]. But for cluster distances other than single linkage in more gen- 
eral data sets, no such speedups are known to the merging process defined above. 
Our data structures improve these clustering algorithms by allowing the nearest 
pair of clusters to be found quickly. 

Theorem 3. We can perform bottom-up hierarchical clustering, for any cluster 
distance function computable in constant time from the distances between subclus- 
ters, in total time 0(n 2 ). We can perform median, centroid, Ward's, or other 
bottom-up clustering methods in which clusters are represented by objects, in time 
0(n 2 log 2 n) and space 0(n). 

Proof. Each step in these clustering algorithms can be performed by finding the 
closest pair of clusters, deleting these two clusters from the set of objects represented 
by our closest pair data structure, and inserting a new object representing the new 
merged cluster. □ 

6. TRAVELING SALESMAN HEURISTIC APPLICATION 

Since the traveling salesman problem is NP-complete, but has many applications, 
a number of heuristics have been devised to approximately solve it. Some, such as 
the nearest neighbor heuristic (discussed above in connection with our low-space 
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closest pair data structure) and the double minimum spanning tree heuristic, can be 
solved easily in quadratic time and linear space (optimal in our non-geometric model 



of computation). However, Bentley Bentley 199C] has shown that these simple 



techniques are outperformed by other, seemingly harder to compute methods, such 
as the multiple fragment heuristic: consider all edges one at a time in sorted order, 
and include an edge if it connects the endpoints of two fragments of tours (connected 
components of previously added edges). 

Theorem 4. We can implement the multiple-fragment heuristic in time 0{n 2 ) 
or in time 0(n 2 log 2 n) and space 0(n). 

Proof. This can be seen as a type of hierarchical clustering, in which clusters 
correspond to fragments, and the distance between two clusters is the length of 
the shortest edge connecting their endpoints. The sequence of edges added by 
the hierarchical clustering algorithm of Theorem ^ is then exactly the same as the 
sequence added in the multiple fragment method. 

Alternatively, instead of maintaining the closest pair among a set of clusters, 
maintain the set of fragment endpoints, with distance +00 between endpoints of 
the same fragment. Each step of the algorithm then consists of selecting the closest 
pair, deleting one or both of these endpoints (if they belong to nontrivial fragments) 
and modifying the distance between the endpoints of the combined fragment. □ 



Another TSP heuristic, cheapest insertion [ Rosenkrantz et al. 1977 1, maintains a 
tour of a subset of the sites, and at each step adds a site by replacing an edge of 
the tour by two edges through the new site. Each successive insertion is chosen as 
the one causing the least additional length in the augmented tour. 

Theorem 5. We can implement the cheapest insertion heuristic in time 0(n 2 ) 
or in time 0(n 2 log 2 n) and space 0(n). 

Proof. We use our data structures to maintain a set of n objects: the k edges 
in the tour after the fcth insertion, and the n — k remaining uninserted sites. The 
distance between an edge and a site is defined to be the increase in length that 
would be caused by the corresponding insertion; all other distances are +00. In 
this way each successive insertion can be found as the closest pair in this set. □ 

For sites in a vector space or other set for which the distance between sites and 



edges is well defined, we can similarly implement nearest insertion \ Rosenkrantz 



et al. 1977f , which inserts the object closest to the current tour. How efficiently we 



can implement the farthest insertion heuristic remains unclear. 
7. GREEDY MATCHING APPLICATION 

The greedy matching of a set of points, with some distance function, is found by 
repeatedly selecting and removing the pair of points with minimum or maximum 
distance, depending on whether one wants a minimum- or maximum- weight match- 



ing. This technique was introduced by Reingold and Tarjan [Eleingold and Tarjan 



1981], who noted that greedy matchings could be constructed in 0(n 2 log n) time 
by sorting the set of distances. Since that paper there has been no improvement in 
the time bounds for greedy matching. 
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Greedy matching is not a particularly good approximation to the minimum 



weight matching [Reingold and Tarjan 1981], even in the average case for one 



dimensional points [Frieze et al. 199C]. However, for maximum weight matching 
with non-negative inter-object distances, greedy matching comes within a factor of 
two of optimal, and may provide a good starting point for augmenting-path based 
techniques for finding optimal matchings. 

Greedy matching may also be appropriate for non-numeric distances for which 
addition is undefined, since it lexicographically minimizes or maximizes the set of 
edge weights in the matching. 

Theorem 6. We can perform greedy matching in time 0(n 2 log 2 n) and space 
0(n), or in time 0(n 2 ). 

Proof. We use the data structures defined above to repeatedly find and delete 
the closest pair. □ 

8. OTHER APPLICATIONS 

We now discuss some other potential applications of our data structures, in which 
the savings they achieve are less easy to quantify. 

8.1 Grobner Bases 

We first consider the problem of computing Grobner bases for polynomial ideals. 
Buchberger's Grobner basis algorithm is a key component of many symbolic algebra 
systems and has found a large number of applications including computational 



geometry and robotics Buchbcrger 1987 1, automated deduction [Clegg et al. 1996] 



and combinatorial enumeration Sturmfels 1996 1. This algorithm takes as input a 



set B = {/i, /2, . . .} of polynomials and a term ordering for comparing monomials, 
and proceeds to modify B in a sequence of steps, in which S -polynomials 5(/i, fj) 
are constructed and added to B, and polynomials in B are simplified by subtracting 
multiples of each other. As the algorithm proceeds, B can grow very large, so space 
efficiency is crucial. Further, the choice of which ^-polynomial to form can make a 
large difference in the algorithm's efficiency. For this reason, many implementations 
follow a suggestion of Buchberger to use the normal selection strategy (e.g. see 



| Adams and Loustaunau 1994 , p. 130]): select fi and fj for which the least common 
multiple of the leading terms of fi and fj is as small as possible in the term ordering 



(Oth er selection strategies have also been proposed Czapor 1991 ; Giovini et al 



1991 1 and it seems likely that our methods apply as well to them.) 

We can easily apply our closest pair data structures to maintain the set B and 
select the appropriate pair /,;, fj. Distances between members of B can be measured 
by least common multiples of leading terms; these values, although non-numeric, 
can be compared by the term ordering. One complication arises, however: once we 
have processed S(fi,fj), we do not want to select the same pair again. So, some 
data structure such as a hash table should be used so we can test whether this 
S(fi, fj) has already been computed, and if so modify the distance between fi and 
fj to Too. Such a modification can performed as efficiently as an insertion in our 
linear-space data structure: simply move fi and fj to a new subset in the partition 
of the objects maintained by the data structure. In our quadtree data structure, 
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no hash table is needed and modification of a single distance is even easier, taking 
time O(logn). 

However, pair selection forms a small part of the runtime of Buchberger's algo- 
rithm (D. Lichtblau, personal communication) so improvements would likely have 
to be made elsewhere to make it worthwhile to implement our data structures for 
this application. 

It may also be of interest to consider applying our techniques to other pair- 
combination methods of automated deduction such as resolution-based theorem 
proving. 

8.2 Constructive Induction 

A second potential application arises in machine learning. Constructive induction 
is a technique for synthesizing new attributes for multi-attribute data, by combin- 
ing pairs of attributes. This method can be used to enhance learning methods that 
can not represent such combinations directly, or that are based on an assumption 
of attribute independence that may not hold in the actual input. For example, 



Pazzani Pazzani 1997] forms new attributes from Cartesian products of pairs of 
discrete- valued attributes, and demonstrates improvements to the learning abilities 
of Bayesian and nearest-neighbor classification systems. In Pazzani's experiments, 
each new product attribute is chosen greedily, as the one that leads to the biggest 
improvement as measured by leave-one-out cross-validation. Such greedy pairwise 
combination again seems a natural application for our data structures, but we can 
only apply them if the quality of an attribute combination stays stable while we 
insert or delete unrelated attributes. According to Pazzani (personal communica- 
tion), this stability does hold in practice. 

8.3 Non-Hierarchical Clustering 



Duran and Odell [Duran and Odcll 1974, p. 23] describe a non- hierarchical clus 



tering procedure due to Ball and Hall [Ball and Hall 1965], to which our methods 



may also apply. In this procedure, a clustering is improved by repeatedly merging 
the closest pair of clusters (measured by average squared distance) and splitting 
the cluster with the highest variance. Clearly, our data structures can be used for 
the merge step, but it is not clear whether this is a significant part of the overall 
complexity of the algorithm, which also includes "if -means" -like phases in which 
clusters are reconstructed by moving objects to the nearest cluster centroid. 

8.4 Local Optimization 

Local search procedures such as two-optimization are a common method for im- 
provement of heuristic solutions to optimization problems such as parts placement, 
traveling salesman tours, or graph partitioning. In these procedures, one mod- 
ifies a suboptimal solution by moving a small number of objects; the "two-" in 
two-optimization refers to the number of objects moved. So, for instance, in graph 
partitioning, one maintains a correct partition while improving the number of cross- 
ing edges, by swapping one vertex on one side of the partition for a vertex on the 
other side; in the traveling salesman problem, one maintains a correct tour while 
improving its length by removing two edges and replacing them by two other edges 
connecting the same four vertices. Our methods can likely be used in some of these 
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problems, to maintain the pair of objects the replacement of which leads to the 
greatest improvement in the objective function. 

However, in practice, local optimization is often combined with techniques such as 
simulated annealing, which randomly selects local changes and allows the objective 
function to become worse in an attempt to escape local minima. It is not clear 
how techniques for maintaining the best local improvement should be combined 
with this simulated annealing approach. Further, application of our ideas to e.g. 
TSP two-optimization is complicated by the fact that only one of the two ways of 
replacing a pair of edges will lead to another valid tour; it is not clear whether our 
data structure can be modified to deal with this additional complication, or with 
similar complications arising in other problems. 



9. IMPLEMENTATION AND EXPERIMENTS 
9.1 Algorithms Implemented 

In order to test our data structures, we implemented them in a testbed of four algo- 
rithms: greedy matching, the multi-fragment TSP heuristic, the cheapest insertion 
TSP heuristic, and hierarchical clustering by unweighted medians (UPGMA). 

We implemented several methods for generating random objects: uniformly dis- 
tributed vectors with various distance functions (including dot product as well as 
the more familiar L\, L2, and Loo metrics), hierarchically clustered points (via a 
generalization of the Sierpinski tetrahedron fractal), random leaves of a large binary 
tree, and random distance matrices. Each object generator allowed all distances to 
be negated, forming a maximization rather than minimization problem. 

The data structures we implemented included our own conga line and quadtree 
methods, brute force , and the "neare st neighbor heuristic" suggested by authors 



including Anderberg [ Anderberg 1973 1. In this method, we store each point's near- 
est neighbor; closest pairs can be found by scanning this list of neighbors. Insertions 
can be performed in O(n) time by computing the nearest neighbor to the inserted 
point and testing whether it should become the nearest neighbor of any other point. 
However, when a point is deleted, any other point for which it is nearest neighbor 
must find a new neighbor; if the deleted point was neighbor to k other points, the 
neighbor heuristic takes time 0{kn). If deletions are random or the points belong 
to a low dimensional metric space, k = O(l) and the time per update is 0(n), but 
it is not hard to find examples in which the worst case time per update is 0(n 2 ). 
We did not implement the priority queue method due to its complexity, high space 
usage and expected poor performance. 

In all the methods we implemented, nearest neighbors were computed by a naive 
sequential scan through all points. In many applications, nearest neighbors can 
be computed more quickly by heuristics such as spiral search; however we did not 
implement this due to its complexity. We believe that faster searching would equally 
speed up brute force, the neighbor heuristic, and our conga line based methods, 
so adding such heuristics should not change our overall experimental conclusions 
except possibly by making the quadtree method (which can not use fast neighbor- 
finding methods) appear worse. 
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Fig. 1. Hierarchical clustering in R. . Points were placed uniformly at random in the unit 
hypercube and clustered by unweighted medians. 

9.2 Simplified Data Structures: MultiConga 

Our conga line implementation includes a parameter k for the number of subsets 
into which to partition the objects. For best results in our theoretical analysis, k 
should be O(logn); our implementation's default is k = log 2 n. Our initial expec- 
tation was that multiplying this default by a small constant might lead to small 
improvements, but that non-logarithmic values would cause the time to blow up. 
To our surprise, the data structure became faster as k grew very large, until the 
number of distance computations stabilized but the overhead of maintaining many 
subsets slowed down the structure. 

In retrospect, we can explain this heuristically: if k is large, we do few merges 
of existing subsets, reducing the amortized time per insertion. Although the worst 
case number of points moved to a new subset by each deletion is k, the expected 
number (if any deletion is equally likely) is 0(1) regardless of the number of subsets, 
so increasing k could typically cause less harm than our worst case analysis would 
suggest. 

With this experience and heuristic justification, we decided to try a modified 
version of the conga line structure, which we call the "multiple-subset conga line" 
or "MultiConga" for short. In this structure, we simply never merge subsets Si] 
instead, whenever an insertion or deletion creates a new subset we let the total 
number of subsets grow. More formally, we modify the conga line data structure 
operations as follows. 

To initialize the data structure. Create a new subset Si containing all the initial 
objects, and create G\. 

To insert x. Create a new subset Si = {x} in the partition of 5, and create Gi 
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Fig. 2. Hierarchical clustering in a 31-dimcnsional fractal. Points were placed uniformly at 
random in a generalized Sierpinski tetrahedron formed by choosing 5 random binary values and 
taking bitwise exclusive ors of each nonempty subset, and clustered by unweighted medians. 

(consisting of a single edge from x to its nearest neighbor) . 

To delete x. Create a new subset Si consisting of all objects y such that (y,x) 
is a directed edge in some Gj. Remove x and all its adjacent edges from all the 
graphs Gj. Create the graph Gi for Si. 

In our experiments, this variant of our closest pair data structure was usually 
faster than the original conga line data structure, sometimes much faster than the 
neighbor heuristic, and only rarely slightly slower than the neighbor heuristic. We 
can provide theoretical evidence for its speed: 

Theorem 7. The MultiConga method described above correctly maintains the 
closest pair in amortized time Oin) per insertion and 0(n 3 ^ 2 ) per deletion. 

Proof. Correctness follows from the correctness of the conga line data structure. 
To prove the time bound, we use a potential function <p — J2i ISi^n 1 / 2 . Each 
insertion changes this potential by n 1 / 2 and takes time 0(n). Each deletion in 
which k points are moved to a new subset takes time 0(kn), but increases <p by 
k 2 n x / 2 — (9(n 3 / 2 ). For any fc, the amortized time (actual time minus difference in 
ip) is 0(kn- k 2 n x l 2 + n 3 / 2 ) = 0{n 3 / 2 ). □ 

Although one can concoct examples in which this worst-case bound is tight, we 
did not find any natural problem for which this method achieved its worst case. 

9.3 Simplified Data Structures: FastPair 

Since the expected number of points moved into a cluster on each deletion is small, 
we decided to try a further simplification. In the "FastPair" method, like Mul- 



14 ■ D. Eppstein 



n=500 2000 8000 




Fig. 3. Greedy matching of points placed uniformly at random in the unit hypercube in R, . 

tiConga, we never merge subsets. But further, in the case that a deletion would 
cause k points to move from their current subsets to a new subset, we instead form 
k singleton subsets. More formally, we have the following two operations: 

To initialize the data structure. Create a new subset Si containing all the initial 
objects, and create G%. 

To insert x. Create a new subset Si = {x} in the partition of S, and create Gi 
(consisting of a single edge from x to its nearest neighbor) . 

To delete x. Create a separate new subset Si = {y} for each object y such that 
(y,x) is a directed edge in some Gj. Remove x and all its adjacent edges from all 
the graphs Gj . Create the graph Gi for each newly created subset Si . 

The advantage of this data structure compared to the previous ones is that each 
object x has an outgoing edge to a neighbor only within the set to which it belongs. 
Therefore, the actual data stored in the structure need only consist of the weight 
of this outgoing edge and the identity of the neighbor it points to. The partition of 
the objets into subsets Si does not need to be stored explicitly, as it is not used by 
the update operations. In addition, the number of edges in the structure is always 
at most ro, so we need not worry about rebuilding when the number of edges grows 
too large. 

This FastPair data structure closely resembles the nearest neighbor heuristic, 
in which again each point remembers a single neighbor. However in the FastPair 
heuristic the stored neighbor may not always be nearest. In the initial construction 
of the data structure, instead of computing nearest neighbors for each point, we 
construct a single conga line, in order to maintain some control over the number 
of incoming edges per object. And, when inserting a new point, we compute its 
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Fig. 4. Greedy matching of points with pseudorandom distances. The distance between two 
points was computed by using their indices to modify the seed for the drand48 random number 
generator. 

nearest neighbor as before, but we do not change the stored neighbors of other 
points even if the newly inserted point is nearer than these stored neighbors. Like 
the nearest neighbor heuristic, the FastPair method takes linear expected time for 
random deletions, but has a quadratic worst case. In our experiments, FastPair 
was always faster than the neighbor heuristic. 

9.4 Experimental Results 

Log-log charts of timing results from our computational experiments are presented 
in Figures 1-9. In the figures, "BF" stands for the brute force method, "NH" for 
the neighbor heuristic, "QT" for our quadtree method, "CL" for the basic conga 
line method, "MC" for MultiConga, and "FP" for FastPair. The times include 
only the construction of the closest pair data structure and algorithm execution 
(not initial point placement) and are averages over ten runs. The algorithms were 
implemented in C++, compiled and optimized by Metrowerks Codewarrior 10, and 
run on a 200MHz PowerPC 603e processor (Apple Powerbook 3400c). The quadtree 
data structure was limited to 1000 points by its high memory requirements; other 
data structures were tested up to the point where a larger input would not fit 
comfortably into an overnight test run. 

We ran one representative application (greedy matching) using a variety of dis- 
tance functions, and ran a selection of other applications on two distance functions 
for which our data structures exhibited strikingly different qualitative behavior (Eu- 
clidean closest pairs for uniformly generated points in R 20 , and rectilinear farthest 
pairs for uniformly generated points in the unit square) . For hierarchical clustering, 
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Fig. 5. Greedy maximum- weight matching of points placed uniformly at random in the unit 
square, with the L\ metric. 

we also ran a further test on a point set with a fractal structure, to test whether 
the behavior we observed on uniform points could be assumed to hold also for more 
realistic clustered data. 

Each application performed linearly many updates, so linear time per update 
translates to quadratic total time in our tests, and quadratic time per update 
translates to cubic total time. Asymptotic runtime can be estimated by examining 
the change in running time when doubling the problem size; if the time increases 
by a factor of four, it can be estimated as quadratic or nearly quadratic, while if 
the time increases by a factor of eight, it can be estimated as cubic. Due to caching 
and other issues, it was common for times to increase by factors larger than the 
theoretical worst case bound, but in general all experiments gave results consistent 
with cubic or quadratic runtimes. 

As expected, brute force always gave cubic runtimes. The neighbor heuristic was 
often quadratic, but on some problems was cubic, even sometimes slower than brute 
force. FastPair was also sometimes quadratic, and sometimes cubic; however it was 
the only method to consistently run faster than the neighbor heuristic (sometimes 
by a linear factor). The remaining methods always exhibited quadratic behavior 
(although MultiConga could theoretically have a slower worst case) but sometimes 
differed by factors of three or more in total runtime. The quadtree method was 
surprisingly slow; although it performed few distance computations, it was generally 
only faster than other methods for problems with expensive distance computations. 
The basic conga line method was often slower by a factor of three to five than its 
simplifications, and on some problems this factor seemed to be increasing with n, 
perhaps showing that the logarithmic factors in its theoretical time bound were 
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Fig. 6. Cheapest insertion heuristic for TSP of points placed uniformly at random in a 20- 
dimensional hypercube. 



active in practice. 

Our conclusion would be to use the quadtree method for problems with few points 
and slow distance computations; to use FastPair for most applications (after testing 
to verify that it behaves well for the given application) and to use MultiConga or 
occasionally the original Conga Line structure when FastPair is known to behave 
poorly or when a more robustly fast method is required. 

The problem of caching remains interesting. The methods we tested involve 
sequential scans through memory, a behavior known to reduce the effectiveness of 
cached memory. Some effects of this appear in our data; for instance the last two 
rows of the brute force data structure for most expensive rectilinear insertion exhibit 
a jump in runtime by a factor of 15, much higher than the factor of 8 indicated by 
the asymptotic analysis. We believe that this jump is due to exceeding the limits 
of the 32Kbyte level I cache on the 603e processor; other jumps can be attributed 
to exceeding the Powerbook 3400's 256K level II cache. 

The fact that the original Conga Line data structure is slower than MultiConga 
and FastPair for rectilinear greedy maximum matching with moderate input sizes, 
even though it performs fewer distance computations, may be due to its higher 
space usage causing poor caching behavior; note that the other two methods become 
slower than it only on problem sizes at which they too exceed the cache. Perhaps 
the relatively poor performance of the quadtree method is also due to its high 
memory usage. It would be of interest to develop more cache-efficient closest pair 
data structures which take better advantage of modern computer memory systems. 
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Fig. 7. Most expensive insertion heuristic for MAXTSP of points placed uniformly at random in 
the unit square, with the L\ metric. 
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